Whole-genome resequencing of Sorghum bicolor and S. bicolor × S. halepense lines provides new insights for improving plant agroecological characteristics

Sorghum (Sorghum bicolor L. (Moench)) is the world’s fifth economically most important cereal and is a staple particularly in the semi-arid tropics of Africa and Asia. Genetic gains in this crop can benefit from wild relatives such as Sorghum halepense. Genome sequences including those from this wild species can boost the study of genome-wide and intraspecific variation for dissecting the genetic basis and improving important traits in sorghum. The whole-genome resequencing carried out in this work on a panel of 172 populations of S. bicolor and S. bicolor × S. halepense (SbxSh) advanced lines generated a total of 567,046,841 SNPs, 91,825,474 indels, 1,532,171 SVs, and 4,973,961 CNVs. Clearly, SbxSh accumulated more variants and mutations with powerful effects on genetic differentiation. A total of 5,548 genes private to SbxSh mapped to biological process GO enrichment terms; 34 of these genes mapped to root system development (GO: 0022622). Two of the root specific genes i.e., ROOT PRIMORDIUM DEFECTIVE 1 (RPD1; GeneID: 8054879) and RETARDED ROOT GROWTH (RRG, GeneID: 8072111), were found to exert direct effect on root growth and development. This is the first report on whole-genome resequencing of a sorghum panel that includes S. halepense genome. Mining the private variants and genes of this wild species can provide insights capable of boosting sorghum genetic improvement, particularly the perenniality trait that is compliant with agroecological practices, sustainable agriculture, and climate change resilience.


Results
Genetic diversity and sequencing assessment. One hundred and seventy-two sorghum lines evaluated in this study clustered in two different populations of 153 S. bicolor genotypes and 19 S. bicolor × S. halepense recombinant inbred lines (Fig. 1). The Fstatistic (Fst) measuring the genetic structure and the level of genetic differentiation 28,29 was significantly and moderately high (Fst = 0.31, p = 0.01); Fst values range from 0 in case of panmixis to 1 in case the populations do not share any genetic diversity. The first two dimensions of the principal coordinates explained 70.9% of the total genetic diversity existing in the studied populations. The one hundred and seventy-two sorghum lines were whole-genome resequenced and a corresponding number (172) of paired-end sequencing libraries constructed each with the insert size around 300 base pairs. Resequencing yielded 22.88 billion paired-end reads resulting in 3.43 trillion bases (nucleotides) and 2.6 TB of high-quality raw data. Ultimately, a total of 21.70 billion and 3.25 trillion of clean paired-end reads and bases were produced, respectively. Overall, 94.54% of total clean reads showed quality value Q20 ≥ 94.54%, which indicates a high data quality. Sorghum bicolor (Sb) and S. bicolor × S. halepense (SbxSh) showed comparable clean reads quality (Q20 = 96.17-96.38% and Q30 = 87.91-88.22%), number of clean reads (125.7-126.19 × 10 6 ) and bases (18.82-18.89 × 10 9 ), and the ratio clean bases: raw bases (94.64-94.89%). However, the guanine-cytosine (GC) rate was higher in SbxSh relative to Sb (43.42% a vs. 43.26% b) in the clean data.
Sequence alignment. Alignment assessment. Sequence reads were aligned to the Sorghum bicolor (BTx623) reference genome whose size was 732,200,000 bp, while the effective size was 675,973,270 bp (N base excluded), and GC content of 41.82%. The mapping rate i.e., the percentage coverage of the reference by reads of the sorghum samples varied from 89.15 to 95.18% with a mean of 92%. The percentage of mapped reads and that of mapped bases were identical and ranged from 82.65 to 99.92% with an average of 99.40%. The effective mapping depth i.e., LN/G, where L is the read length, N is the number of reads and G is the haploid genome length, was between 23.17X to 34 www.nature.com/scientificreports/ SNP calling and annotation. The sequence alignment of the target sorghum lines to the reference genome BTx623, the gene models, and the information derived from the reference genome allowed to identify large numbers of SNPs, InDels, CNVs, and SVs (Table 1; Fig. 2). A total of 567,046,841 SNPs was uncovered from these sorghum genomes. On average, 10,515,367.4 SNPs were observed per individual, of which 1,855,062 were located in the genic regions. The statistical analysis showed that SbxSh had more total and heterozygous SNPs, synonymous_CDS, nonsyn_CDS, exonic, genic, intronic, mRNA, pseudogenic, transcript, and tRNA SNPs, whereas Sb showed more homozygous SNPs. Among the synonymous and nonsynonymous SNPs mapped in the coding regions in either population, the synonymous SNPs represented 49% and 54%, respectively, in Sb and SbxSh populations. The SbxSh population contained 81% and 78%, respectively, of all synonymous and nonsynonymous SNPs. Contrary to synonymous mutations, nonsynonymous mutations cause variation in coding amino acids and are considered to play a significant role in changing the phenotype of organisms. Besides, nonsynonymous mutations are also strong candidates to explain the phenotypic diversity between different individuals in a population. We further analyzed the distribution of the large-effect SNPs i.e., those with a potential to disable gene functions 26,30 . In this work, large-effect SNPS included premature stop codon, stop codon to non-stop codon, start codon to non-start codon, and splice sites. It was found that within the 10,140 SNPs participating in codon premature termination, 2,970 SNPs disrupt splicing donor or acceptor sites of genome, 13,976 SNPs are related to alteration of initiation methionine residues, and 1,144 SNPs replace terminators with certain amino acid residues that leads to longer ORFs. The statistics are depicted in Fig. 3 where SbxSh showed higher numbers of large-effect SNPs than Sb. In SbxSh population, it was found an average of 1967, 500, 370, and 700 SNPs expected to induce premature stop codon, stop codon to non-stop codon, start codon to non-start codon, and splice sites, respectively, while in Sb it was found a maximum of 1183, 340, 100, and 340, respectively. In both populations, SNPs inducing premature stop codon were most represented with respect to the other large-effect SNPs.  Genetic variation in S. bicolor and S. bicolor × S. halepense hybrids. One of our working hypotheses was that some of the identified genetic variation might contribute to the phenotypic differentiation between S. bicolor and S. bicolor × S. halepense which pushed us to focus our analysis on SNPs in genic regions. Sorghum breeders working on the introgression of perenniality in otherwise annual S. bicolor background select for overall sorghum bicolor plant aspect in addition to the overwintering trait. We also observed from our experience in recent years 14 that backcrosses are the most attractive as they show traits closely comparable to domesticated sorghum (panicle shape, seed size, etc..) than single, double, or three-way crosses. We therefore used the data produced in this and previous studies from our laboratory to investigate the contribution of Sorghum halepense in SbxSh controlled hybridizations. As Fig. 2 shows, SbxSh9 backcross line has more genes and more SNPs, CNVs, and indels than S. bicolor Sb1; nonetheless, the two lines displayed comparable number of SVs. The same pattern was observed across populations. The density of genes, SNPs, indels, and SVs increases from the pericentromeric region towards the telomeres, with SNPs/genes and short indels showing similar distribution pattern. In both populations, the distribution of CNVs was homogeneous from the centromeres to telomeres in all chromosomes. In addition, sorghum biomass related significant SNPs and candidate genes recently uncovered 8 in these populations i.e., Dw genes (Dw1, Dw2, Dw3, Dw4), Ma genes (Ma1, Ma2, Ma3, Ma5, Ma6), gibberellin (GA) associated genes (SbGA2ox1, SbGA3ox1, SbGA2ox7), genes involved in controlling heading date (SbZCN8) 1,31 and GA signaling and plant height regulation (SbSLR1 1 ) were mostly localized towards the distal and proximal ends of the chromosomes of interest (Fig. 2). The analysis of genes harboring SNPs showed more genes (18,785) were shared between Sb and BC1 crosses, with relatively fewer genes i.e., 109, 230, and 291 being private to the respective three lines Sb1, SbxSh50, and SbxSh9 (Fig. 4). The analysis of the biological processes (BP) GO enrichment associated to SbxSh genes showed 33,693, and 5548 genes in the Sorghum bicolor reference genome dataset and in SbxSh9/SbxSh5 hybrids gene www.nature.com/scientificreports/ set that mapped to the GO terms (either directly or through inheritance), respectively ( Table 2). The most granular terms included enrichment in genes associated to the plant-type cell wall organization (

Discussion
The development of perennial sorghum initiated in 2015 in our breeding program; earlier morpho-agronomic performances of evaluated materials was reported in our previous works 14 . According to our experience and available literature 19 , the production of rhizomes is the sine qua non condition for SbxSh lines to remain perennial under temperate climates like the prevalent conditions in our Italian experimental stations 14 . One of the most interesting findings from recent studies is the absence of negative trade-off between rhizome development and seed yield and aboveground biomass yield. This is expected to allow the development of high-yielding biomass, grain, and dual purpose sorghum ideotypes expressing perennating belowground structures 14,19 . Sorghum bicolor × S. halepense perennial lines showed competitiveness relative to commercial hybrids in terms of above-ground biomass production and grain yields; however mostly backcross-derived lines showed overall plant aspect 27 and domestication traits such as high seed yield, big caryopses, seed shattering resistance, compact inflorescence and stalk strength closer to S. bicolor than other crosses 14,20 .
Grain yield was significantly correlated with maturity, dry mass yield, dry mass fraction of fresh material, number of culms, rhizome development, hemicellulose, and rhizome survival but important correlation coefficients were observed for maturity, number of culms and rhizome development. The traits that showed significant medium to high 32 correlation with aboveground dry mass yield included plant height, dry mass fraction of the fresh material, number of culms, neutral detergent fiber.
In this work, we completed the first whole-genome resequencing analysis of a unique panel made up of Sorghum bicolor and S. bicolor × S. halepense populations. Our resequencing focused on S. bicolor × S. halepense recombinant inbred lines (RILs) instead S. halepense per se and this can be explained by our effort to align the resequencing experiment with our breeding program to develop perennial sorghum cultivars 14 . The SbxSh RILs were developed through crossing and selection to minimize the S. halepense associated linkage drag e.g., seed shattering, small-sized kernels, excessive tillers and rhizomes, and undesirable inflorescence compactness and shape in our breeding populations. The resequencing was therefore expected to explain the parental lines contributions to the genomic composition of the perennial hybrid combinations. The use of wild relatives in genetic introgressions is generally accompanied by linkage drag associated with the introduction of unfavorable traits along with the favorable ones 33 , and this necessitates a significant and time-consuming breeding effort to recover the domesticated phenotype, particularly when the primary produce is the grain 18,34 . This resequencing experiment is important in plant breeding, particularly in sorghum. Novel variants will be used for gene discovery, while a great number of uncovered high-quality polymorphisms will be harnessed in the process of genomic selection, genome-wide association studies, and marker-assisted selection. As these are founder populations for our entire breeding program, such opportunity the resequencing offered cannot be overemphasized 35 , particularly in terms of increased genomic predictions and mapping precision of quantitative trait loci. Furthermore, no genotyping platforms e.g., arrays, chips as those in use in other crop species e.g., tomato, potato or pepper, have been developed so far in sorghum for high-throughput genotyping of sorghum traits, particularly those associated with perenniality 26 . The major aim of this work was therefore to develop a large repertoire of genomic information and polymorphism data sets that can be used for gene discovery and validation, and as a source of markers to build genotyping platforms for applied breeding purposes.
In this work, a total of 21.70 billion and 3,25 trillion of clean paired-end reads and bases were produced, respectively. Overall, 94.54% of total clean reads showed quality value Q20 ≥ 94.54%; such quality value exceeded 96% when calculated in S. bicolor and S. bicolor × S. halepense separately, which indicates a high data quality. The mapping rate i.e., the percentage coverage of the reference by reads of the sorghum samples varied from 89.15% to 95.18% with a mean of 92%, while the percentage of mapped reads ranged from 82.65% to 99.92% with an average of 99.40%, indicating a high sequencing accuracy and the absence of contaminating DNA. The effective mapping depth was between 23.17X to 34 www.nature.com/scientificreports/ depth obtained in this study was higher than in most previous mapping experiments that showed values around 10X 36,37 , and encompassed the entire reference genome length in homogeneous pattern in all accessions. The mapping rate and the percentage of mapped reads realized in this study were better than reported in previous works, and confirm the high quality of the reference sequence used. For instance, Gramazio et al. 26 reported a mean mapping rate of 85.4% with a range from 76.9 to 88.7%. On the other hand, in some model species, the average rates of unmapped reads were higher e.g., 3-5% in tomato 38,39 and 10-15% in rice 40,41 . Differences in mapping experiments can be attributed to a variety of factors including differences in: (1) the progress of the sequence assembly, (2) the levels of repetitive elements, (3) genetic divergence between the sequenced samples and the reference genome, and (4) the levels of variants polymorphisms 38,41 . Sorghum bicolor × S. halepense and S. bicolor populations showed comparable mapped reads, mapped bases, sequencing depth after mapping, while Sorghum bicolor × S. halepense showed statistically significant higher percentage coverage rate (94.87a vs. 91.62b), yet lower unique percentage hit bases (81.96a vs. 77.58b) and unique percentage hit reads (82.39a vs. 78.31b). Since Sorghum bicolor (Sb) and S. bicolor × S. halepense showed comparable clean reads quality (Q20 and Q30), and the number of clean reads and bases, the higher coverage rate observed in S. bicolor × S. halepense can be attributed to the lower rates of uniquely mapped reads and hence the existence of reads mapping to multiple reference genomic loci with low level of sequence similarity with the target sequence. The existence of multi-mapped reads in S. bicolor × S. halepense can be explained by this population producing statistically higher number of short indels, insertions of long fragments (at least 50 bp), CNVs, CNV up-regulations and down-regulations than S. bicolor 42 . A relatively lower level of sequence similarity was expected between S. bicolor reference sequence and S. bicolor × S. halepense in virtue of the genetic distance that existed between the two genomes ( Fig. 1) deriving mainly from S. halepense. Under such circumstances, several authors 39 pointed out the need for sequencing and assembly several reference genomes from crop wild relatives to avoid biased resequencing analyses and to improve the rate of uniquely mapped reads. However, since the dynamics of gene gains and losses during plants evolution and particularly during the interploid hybridization between S. bicolor and S. halepense is not yet fully understood, other reasons may explain the higher genome coverage of S. bicolor × S. halepense.
The SbxSh population showed a greater degree of heterozygosity (Table 1), which is consistent with previous genetic analyses 34 , and can be explained by the genetic history of S. bicolor that underwent the bottleneck of domestication, resulting in a narrowing of the genetic base with respect to wild species, while the tetraploid nature of S. halepense and its progeny may have played a major role in the heterozygosity observed in the SbxSh population; the fixation of alleles requires a higher number of generations in polyploids, and heterozygosity decreases slowly even in the presence of repeated cycles of self-fertilization 43 . Furthermore, the whole-genome resequencing reads were aligned to the S. bicolor reference genome 13 ; alignment of sequences from an allotetraploid SbxSh to a diploid genome can result in an overestimation of heterozygous loci due to alignment of homeologs. In S. halepense homeologs derived from orthologs in the genomes of its ancestors (S. bicolor and S. propinquum) are conserved, but following S. bicolor × S. halepense hybridization it becomes difficult to predict the fate of such homeologs across generations because of different possibilities of chromosome pairing and independent assortment at meiosis 44 . It is nonetheless expected that at least some of the homeolog chromosome pairs can be maintained and contribute to increasing the heterozygosity of the recombinant inbred lines.
Our study has identified a large set of polymorphisms, consisting of 665,378,447 of high-quality variants including SNPs, indels, SVs, and CNVs; SNPs represented 85.22% of all variants, which is in agreement with previous works 26 . The identification of more SNPs in the present sorghum panel represents a good breeding opportunity as these markers are cheaper and easy to automate for high-throughput genotyping with respect to other markers 45,46 . The whole-genome resequencing completed in this work is therefore the starting point to develop a large number of markers not only in S. bicolor but also in Sorghum wild relatives such as Sorghum halepense, for which the lack of such information slowed down their use in breeding programs 39,47 . There are success stories on the opportunity to harness variants polymorphisms from crop wild relatives e.g., in soybean 37,48 , rice 41 , and tomato 39,49 , and eggplant 26 . In our study, SbxSh population produced more variants than Sb population, which confirmed the findings in previous works showing that crop wild relatives yield more variations relative to landraces or cultivated accessions. Our study highlights therefore the possibility for a controlled introgression of the variation from S. halepense to broaden the genetic basis of S. bicolor; similar introgressions were achieved in other crops e.g., rice, tomato, and wheat 50 . To our best knowledge, our study represents the first effort to harness the valuable large pool of genetic diversity from S. halepense using whole genome resequencing. Similar panels were evaluated in previous studies but relied on genotyping-by-sequencing platforms that showed technical limitations particularly associated with very low sequencing depth (~ 1.5X) and poor coverage 8,34 . Examples of such previous genotyping-by-sequencing-based investigations include linkage disequilibrium studies on biomass and biomass-related trait in sorghum 8 , antioxidant traits in sorghum 34 , and Sorghum plant architecture 51 . In addition, Habyarimana and Lopez used genotyping-by-sequencing SNPs to carry out genomic prediction and selection in sorghum 52,53 . The information produced in this work and the variants identified from S. halepense are expected to accelerate the introgression of perenniality and other useful genomic regions of this rhizome-producing species 18 to develop superior climate/resilient and agroecological practice compliant sorghum cultivars. A large number of high-confidence polymorphisms was also identified in S. bicolor population and will be harnessed for highthroughput genotyping in cultivated or wild sorghum species using high-throughput genotyping platforms e.g., arrays or chips. In this work, genotypes were called individually for each sample for all variants but, for SNPs, we also performed joint genotyping across samples to produce a multi-sample VCF call-set for further investigations. The multi-sample VCF call-set produced vcf files of 33 and 6 Mb of SNPs and indels, respectively, with a good coverage and sequencing depth. These matrices will be used to provide more insights and improving previous studies particularly in domestication, genomic predictions, genome-wide association studies, and phylogenetics 26  www.nature.com/scientificreports/ The Fst statistics which is a metric of population structure, confirmed previous studies 8 showing that Sb and SbxSh form two distinct populations. In addition, the differentiation between the two populations is supported by the higher number of variants observed in the SbxSh population, particularly SNPs, large-effect SNPs, CNVs, SVs, indels and frameshift mutations. The Fst achieved in this work was higher than or comparable to previous reports 8,54 ; the observed Fst discrepancies can be attributed to differences in the number of markers used, population genetic diversity, and in the sampling approaches implemented in these works. An outlier (SbxSh102) was identified that is genetically closer to S. bicolor population. The SbxSh102 consists of two doses of S. bicolor recurrent parent (Tx623) in the S. bicolor × S. halepense (Gypsum 9) controlled cross and this perennial RIL is of future genetics and breeding interests in the development of perennial Sorghum bicolor ideotypes.
Our study produced genes and variants with a higher density that better covered entire lengths of individual chromosomes than in previous works 52,55,56 . The density of genes, SNPs and indels showed similar chromosomal distribution pattern, increasing from the pericentromeric region towards the telomeres. Such better variants distribution pattern is expected to boost novel gene and major marker discovery. In previous works 8,31,57 , uncovered significant SNPs and candidate genes were mostly localized towards the distal and proximal ends of the chromosomes of interest. Our whole-genome resequencing work produced high density of marker variants covering entire genome and offers therefore the opportunity to uncover novel major marker variants and genes in the pericentromeric regions that currently show scarce such information. The analysis of the biological processes GO enrichment associated to SbxSh genes showed 5,548 private genes that mapped to the important GO terms; 34 of these genes mapped to root system development (GO:0022622) two (GeneID:8054879 and GeneID:8072111) of which were reported to govern root properties 58,59 . Studies conducted in Arabidopsis thaliana showed that ROOT PRIMORDIUM DEFECTIVE 1 (RPD1; GeneID:8054879) is required for the maintenance of active cell proliferation and plays a critical role in the development of roots 58 , while the RETARDED ROOT GROWTH (RRG, GeneID:8072111) gene is predominantly expressed in the root meristem and encodes a mitochondrialocalized protein that is required for cell division in the root meristem (Xiaojing Zhou et al.). The variants and functional analyses conducted in this work showed that mining the SbxSh private variants and genes can provide insights on genetic factors controlling plant characteristics capable of boosting sorghum genetic improvement, particularly the perenniality trait that is compliant with agroecological practices, sustainable agriculture, and climate change resilience.

Conclusions
This work generated the first whole genome map of SNPs, indels, SVs, and CNVs in a sorghum panel that includes S. halepense genome, which can be used as a framework for future investigations in functional genomics and genome-assisted breeding. Sorghum is the world's fifth economically most important cereal and is a staple particularly in the semi-arid tropics of Africa and Asia, and is globally used for varied purposes including food, feed, fodder, commercial grade alcohol, as well as first, second, and third generation biofuels. The variants (SNPs, indels, SVs and CNVs) uncovered herein will boost genomic studies e.g., genomic prediction and selection, linkage and linkage disequilibrium analyses, molecular basis of several sorghum plant characteristics, all of which can build breakthroughs to achieve significant genetic gains in sorghum crop.

Materials and methods
Plant materials and sequence data sets. One hundred seventy-two sorghum genotypes including 19 S. bicolor × S. halepense (SbxSh) advanced inbred lines and 153 S. bicolor (Sb) lines were whole-genome resequenced. Sorghum bicolor genotypes were comprised of tropical landraces, improved lines therefrom, and temperate breeding lines. The SbxSh lines were at different levels (F 4 -F 7 ) of filial progeny generated from crosses involving annual/perennial (A/P) genotypes and A/P backcrosses to annual recurrent parents (A*2/P; BC1), perennial/perennial (P/P) and annual/perennial//perennial (A/P//P) hybrid combinations followed by cycles of selection. The annual parental lines were standard diploid (2n = 20), induced tetraploids (2n = 40), cytoplasmicgenetic male-sterile, and genetic male-sterile inbred sorghum lines. Perennial parental lines consisted of a S. halepense plant and tetraploid lines derived from controlled hybridization of induced tetraploid sorghum plants with S. halepense. These populations were described in Habyarimana et al. 8,14 . Interspecific hybridization techniques between Sb and Sh were recently described by Hodnett et al. 60 .
Total genomic DNA was extracted from 10-day old etiolated sorghum seedlings grown under standard glasshouse conditions, according to the cetyl trimethylammonium bromide method 61  Bioinformatics analyses. Sequence processing, mapping, and polymorphisms calling. The raw sequences were processed with the supplier's in-house SOAPnuke filter to obtain clean reads by discarding reads with more than 50% adaptor sequence, low quality reads for which more than 50% bases display Phred score less than 20, and reads with 2% or more "N" bases i.e., any base. The processed reads were then mapped to the reference genome of S. bicolor (BTx623) version 3.1.1 13 using Burrows-Wheeler Aligner (BWA) 64 . BWA showed good performance aligning relatively short nucleotide sequences against a long reference and producing accurate and fast SNP calling and annotation. To detect high-quality SNPs we first calculated the likelihood of each sample's genotype using SOAPsnp 70 and the genotype with the highest probability was selected as the genotype of the sequenced individual at the specific locus. Next, we selected a polymorphic locus against the reference sequence using the target consensus sequence, and based on the resequencing data of 172 samples, we determined SNPs located in effective sites with sufficient quality i.e., responding to the following criteria: 3 ≤ depth ≤ 50, with depth calculated using data from each individual, average mappable sites < 1.5, and an average quality for the novel allele > 20. The SNPs were localized in splice sites, start codons, stop codons, coding and noncoding regions, and other nucleic acid molecules based on annotated gene models in S. bicolor genome reference database 13 .
Short InDel detection. To identify short indel we mapped the paired-end reads to the reference sequence allowing up to 10-bp gaps, merged these redundant pairs, and gaps that were supported by at least three non-redundant paired-end reads were extracted. A potential indel was identified when the number of the un-gapped reads that crossed a potential indel was no more than twice that of the gapped reads. The final high-quality indels included only those identified on both strands by paired-end reads.
Structure variation detection. Structure variation (SV) includes deletion, insertion, duplication, inversion and transposition of long fragment (at least 50 bp) in genome. In this study, we used SOAPsv 64 to detect SVs based on the principle of paired-end 64 i.e., that one of the two reads of paired-end should align onto the forward chain, while the other should be aligned onto the negative (reverse) chain. In addition, the distance between the two reads after the alignment should amount the size of the insert, and pairs of two reads should have a normal orientation and a suitable span when aligned to the genome. Should the orientation or span of pairs of two reads be not consistent with alignment expectations, structural variations may be involved in that region. The abnormal paired-end alignments are analyzed by clustering and the result compared with predefined SV types. A threshold of 3 abnormal paired end reads is required to support the SV existence, while SVs that were supported by at least six paired end reads were considered of high quality and identified as the final SVs in this work.
Copy number variation detection. We detected CNVs by the following steps: (i) DNA sequences were separated into fragments according to the depth of each base from the alignment results; (ii) we calculated the P-value for each fragment to estimate its probability to be a CNV; and (iii) fragments that passed the criteria (fragment length longer than 2 kb, Pvalue ≤ 0.35, mean depth less than 0.5 or more than 2.0) were kept as CNVs. The P-value was calculated as the probability of each observed depth (d) under the distribution of a simulated Poisson distributed data set whose expected value (E(d)) equals the observed mean depth. If d < E(d), the P-value = P(x ≤ d) × 2, else P-value = P(x ≥ d) × 2. The credibility of a CNV is inversely proportional to the P-value.
Sorghum halepense-associated genes and gene ontology (GO) enrichment analysis. Gene ontology and in-depth molecular genetic investigation were carried out on one dual purpose Sb line derived from an improved tropical variety (IESV 99091 DL) and two sister perennial RILs (SbxSh9 and SbxSh50) derived from rhizome-growing SbxSh cross backcrossed (2 recurrent parent doses: Tx623*2/Gypsum 9; BC1) to cytoplasmic male sterile recurrent Sb parent, at more than six generations of selfing. A set of 12,484 single nucleotide polymorphism-containing genes identified in the two sister SbxSh RILs but not in IESV 99091 DL were selected as the candidate gene set associated with S. halepense (Fig. 4). These genes were mapped to gene ontology (GO) 26 to evaluate their characteristics, using PANTHER Gene List Analysis tools 71 . PANTHER takes a set of genes and compares the frequency of GO terms in the sample set with the frequency of the same set of GO terms in the reference set to identify terms that are over-or underrepresented in the sample set. In this work, we conducted PANTHER overrepresentation test using the GO Ontology database https:// doi. org/ 10. 5281/ zenodo. 47356 77 released May 01, 2021. The reference list consisted of Sorghum bicolor (all genes in database), while the annotation data sets were "GO molecular function complete", "GO biological process complete", and "GO cellular component complete", which are the datasets with the complete, up to date GO annotations. The binomial test 72 was used and the Bonferroni correction applied to account for multiple testing (one for each pathway, or each ontology term) at the same time. Only Bonferroni-corrected results with a probability level P < 0.05, were considered significant i.e., the lower the P value, the less likely the obtained result can be explained by random distribution.
Genomic mapping of biomass yield and biomass relevant traits loci in the evaluated population. The information from Habyarimana et al. 8 was used in this work for genomic physical mapping of biomass related SNPs, candidate genes, and genes known to underpin sorghum plant height and maturity. The transcripts of known genes were identified on phytozome 63 . In their work Habyarimana et al. 8 , genome-wide association study was www.nature.com/scientificreports/ performed using the statistical genetics package Genome Association and Prediction Integrated Tool (GAPIT) 73 within the R environment 74 . In addition, two multi-locus GWAS algorithms were used to identify significant quantitative trait loci (QTLs) for the biomass-related traits: BLINK (Bayesian-information and Linkage disequilibrium Iteratively Nested Keyway) 75,76 and SUPER (Settlement of MLM Under Progressively Exclusive Relationship) 77 .
Statistics. Statistical inferences to separate means e.g., sequencing statistics, were carried out using analysis of variance and Tukey HSD test at the 5% significance level. Genetic diversity was evaluated using the Fstatistic and the principal coordinates analysis 78 . Statistical inferences and data visualization were carried out using R software 74 .

Data availability
The datasets generated during and/or analyzed during the current study are not publicly available due to planned near future use, but are available from the first corresponding author on reasonable request.